Structural Parsing of Natural Language Text in Tamil Using Phrase Structure Hybrid Language Model
ثبت نشده
چکیده
Parsing is important in Linguistics and Natural Language Processing to understand the syntax and semantics of a natural language grammar. Parsing natural language text is challenging because of the problems like ambiguity and inefficiency. Also the interpretation of natural language text depends on context based techniques. A probabilistic component is essential to resolve ambiguity in both syntax and semantics thereby increasing accuracy and efficiency of the parser. Tamil language has some inherent features which are more challenging. In order to obtain the solutions, lexicalized and statistical approach is to be applied in the parsing with the aid of a language model. Statistical models mainly focus on semantics of the language which are suitable for large vocabulary tasks where as structural methods focus on syntax which models small vocabulary tasks. A statistical language model based on Trigram for Tamil language with medium vocabulary of 5000 words has been built. Though statistical parsing gives better performance through tri-gram probabilities and large vocabulary size, it has some disadvantages like focus on semantics rather than syntax, lack of support in free ordering of words and long term relationship. To overcome the disadvantages a structural component is to be incorporated in statistical language models which leads to the implementation of hybrid language models. This paper has attempted to build phrase structured hybrid language model which resolves above mentioned disadvantages. In the development of hybrid language model, new part of speech tag set for Tamil language has been developed with more than 500 tags which have the wider coverage. A phrase structured Treebank has been developed with 326 Tamil sentences which covers more than 5000 words. A hybrid language model has been trained with the phrase structured Treebank using immediate head parsing technique. Lexicalized and statistical parser which employs this hybrid language model and immediate head parsing technique gives better results than pure grammar and trigram based model. Keywords— Hybrid Language Model, Immediate Head Parsing, Lexicalized and Statistical Parsing, Natural Language Processing, Parts of Speech, Probabilistic Context Free Grammar, Tamil Language, Tree Bank. Manuscript received December 27, 2007. This work was supported in part by Tamil Virtual University, Chennai, India. Selvam M is Assistant Professor, Department of Information Technology, Kongu Engineering College, Perundurai 638052, Erode, Tamilnadu, India. Phone: +91-4294-226570, Mobile: +91-9486655106; fax: +91-4294-220087; e-mail: [email protected]. Natarajan A M is Principal of Kongu Engineering College, Perundurai – 638052, Erode, Tamilnadu, India. Thangarajan R is Assistant Professor, in the Department of Information Technology, Kongu Engineering College, Perundurai – 638052, Erode, Tamilnadu, India. (e-mail: [email protected])
منابع مشابه
Lexicalized and Statistical Parsing of Natural Language Text in Tamil using Hybrid Language Models
Parsing is an important process of Natural Language Processing (NLP) and Computational Linguistics which is used to understand the syntax and semantics of a natural language (NL) sentences confined to the grammar. Parser is a computational system which processes input sentence according to the productions of the grammar, and builds one or more constituent structures which conform to the grammar...
متن کاملStructural Parsing of Natural Language Text in Tamil Using Phrase Structure Hybrid Language Model
Parsing is important in Linguistics and Natural Language Processing to understand the syntax and semantics of a natural language grammar. Parsing natural language text is challenging because of the problems like ambiguity and inefficiency. Also the interpretation of natural language text depends on context based techniques. A probabilistic component is essential to resolve ambiguity in both syn...
متن کاملStructural Parsing of Natural Language Text in Tamil Language Using Dependency Language Model
Parsing is an important process of Natural Language Processing (NLP) and Computational Linguistics which is used to understand the syntax and semantics of a natural language sentences confined to the grammar. Parser is a computational system which processes input sentence according to the productions of the grammar, and builds one or more constituent structures which conform to the grammar. The...
متن کاملAn efficiency dependency parser using hybrid approach for tamil language
Natural language processing is a prompt research area across the country. Parsing is one of the very crucial tool in language analysis system which aims to forecast the structural relationship among the words in a given sentence. Many researchers have already developed so many language tools but the accuracy is not meet out the human expectation level, thus the research is still exists. Machine...
متن کاملAn improved joint model: POS tagging and dependency parsing
Dependency parsing is a way of syntactic parsing and a natural language that automatically analyzes the dependency structure of sentences, and the input for each sentence creates a dependency graph. Part-Of-Speech (POS) tagging is a prerequisite for dependency parsing. Generally, dependency parsers do the POS tagging task along with dependency parsing in a pipeline mode. Unfortunately, in pipel...
متن کامل